MSc. Data Science for Public Policy


IDS Workshop 2022


Introduction to Data Science

Professor : Simon Munzert, PhD.

Students name: Jorge Roa, Augusto Fonseca, Alexander Kraess


🧬 Quanteda


  • Quanteda is an R package for managing and analyzing textual data.
  • Developed by Kenneth Benoit, Kohei Watanabe, and other contributors.
  • Oficial information about the package: Quanteda webpage
  • Also available on CRAN


🌅 What is quanteda and what do we need it for?


  • The world of data has experienced unprecedented growth.

    • Text data has also increased with time, so its analysis and processing represent a great opportunity.
    • Political speeches, texts, social media, messages, digitalization of old texts.
  • Natural Language Processing (NLP)

    • NLP: the way computers read text and imitate human language.

    • We can apply NLP techinques with quanteda: more easy to do research. (Tokenization, Stopwords, and part of speeches)

  • Quanteda is a package that gives you the power of process, wrangle and analyze text in multiple ways.

    • It’s easy to use and the applications that has are enormous.

    • Quantitative and Qualitative Analysis: best of both worlds in one single package.

    • Text analysis: best way to do it.



🏗 What do we need Quanteda for?


A lot of data is in text form, many tools convert audios into text and there is a lot of text data on webpages and social media.

  • Social science:

    • Analysis of political speeches.
    • Theory building and testing through text analysis.



💣 The power of Quanteda allow us to do multiple analysis in different areas 💣

❓❗ What do you need to always remember about quanteda

Three things


📖 1.-Corpus: the original data that will be pre-processed and analyzed.


🛠 2.-Tokens: Tokenization storing the words of our texts for further analysis.


📑 3.-Document Feature Matrix (DFM): helps us analyze and store the features of a text.


  • 📚 Text files: Quanteda uses readtextpackage. We can read .txt, .csv, .tab, .json. files.

    • Even, we can read .pdf, .doc and .docx files.

    • Amazing, right? For our tutorial, we will use txt files.



🧙 Corpus


Important things:


  • 📚 We can create a corpus from:

    • Character vectors c()

    • 🖼 Dataframes that contain one column with a string or a text to be analyzed.

    • ⛔ IMPORTANT ⛔: your string variable of your df must be name as text

    • SimpleCorpus from tm package.



Here you can appreciate with our exercises what we can obtain.

1.-Text: Name of our document. In our case, the names are the episodes titles of HIMYM.

2.-Types: Different types of features that we can wrangle.

3.-Tokens: Number of tokens that our documents have.

4.-Sentences: Number of sentences per document. In our case, TV scripts.

5.-Chapter and No.overall are variables that we added. We will explain that later.



🪆 Tokens


Important things:


Tokens are just characters that segments texts into tokens (mainly words or sentences) by word boundaries.

  • 📚 What a token object contains:

    • Documents and docvars with the split of them into small units: words.
  • 😎 Why tokenization is awesome?

  • You have functions like

    • remove_separators
    • remove_numbers
    • remove_symbols

Here you can appreciate with our exercises what we can obtain.

You can see the words that are separated.



📜 Document Feature Matrix


Important things:

DFM objects are super useful because we can do stats with them and analysis in general.

  • 📜 What a DFM object contains:

    • A matrix is a 2 dimensional array with m rows, and n columns.

    • In a dfm each row represents a document, and each column represents a feature.

    • Enables us to identify the most frequent features of a document.

    • Analyzes text based on the “bag of words” model.


Here you can appreciate with our exercises what we can obtain.

You can see the features.




⌛ Workflow


Source: our amazing classmates from the MDS 2023: Laura Menicacci & Dinah Rabe


🏎 Principal functions of Quanteda


Main parts:


Remember that you can use a pipe%>%for all the functions of the package.

  • First step: corpus(your_dataframe, text, etc) = Creates a corpus object from available sources.

  • Second step: tokens(your_corpus_object) = Construct a tokens object.

  • Third step: dfm(your_token_object) = Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.


🧙 Corpus functions


Remember that you can use a pipe%>%for all the functions of the package.

docnames(your_corpus) = rename you docvars.

corpus_subset() = subsets of a corpus that meet certain condition. Like a filter.

corpus_group(your_text_object, dataframe, etc) = Combine documents in a corpus object by a grouping variable.

corpus_trim(your_text_object, dataframe, etc) = Removes sentences from a corpus or a character vector shorter than a specified length.

corpus_segment(your_text_object, dataframe, etc) = Segment corpus text(s) or a character vector, splitting on a pattern match.


🪆 Token functions


Remember that you can use a pipe%>%for all the functions of the package.

tokens() = Construct a tokens object.

  • tokens_select(your_token_obj) = These function select or discard tokens from a tokens object.
    • tokens_remove(your_token_obj) = Same as tokens select, but we remove words, phrases, etc.
    • tokens_keep(your_token_obj) = Same as tokens select, but we keep words, phrases, etc.

tokens_group(your_token_obj) = Combine documents in a tokens object by a grouping variable.

tokens_tolower(your_token_obj) = Convert the features of a tokens object and re-index the types. All to lower cases.


📜 Document Feature Matrix functions


Remember that you can use a pipe %>% for all the functions of the package.

dfm(your_token_obj) = Construct a sparse document-feature matrix.

dfm_lookup(your_token_obj) = Apply a dictionary to a dfm by looking up all dfm features for matches.

dfm_match(your_token_obj) = Match the feature set of a dfm to a specified vector of feature names.

dfm_subset(your_token_obj) = Returns document subsets of a dfm that meet certain condition


📌 Objective


This workshop aims to use the incredible quanteda package to analyze the television series “How I Met Your Mother” and demonstrate many of the quanteda package’s tools. We will explore the characters, identify adjectives, render Wordclouds, network plots and even sentiment analysis.


🎸 How I Met Your Mother



Plot: “Ted has fallen in love. It all started when his best friend, Marshall, drops the bombshell that he plans to propose to longtime girlfriend Lily, a kindergarten teacher. Suddenly, Ted realizes that he had better get a move on if he hopes to find true love. Helping him in the quest is Barney, a friend with endless – often outrageous – opinions, a penchant for suits and a foolproof way to meet women. When Ted meets Robin, he is sure it’s love at first sight, but the affair fizzles into friendship. Voice-over by Bob Saget (”Full House”) tells the story through flashbacks.”

Source: Rotten Tomatoes


🎸 Principal Characters


Ted

Actor: Josh Radnor

Barney

Actor: Neil Patrick Harris

Robin

Actor: Cobin Smulders

Marshall

Actor: Jason Segel

Lily

Actor: Alyson Hannigan


“The story of five friends sitting in their favorite booth at MacLaren’s, their lives unfolding in front of each other, How I Met Your Mother is heartwarming and hilarious at the same time. Some believe that HIMYM is Ted’s story. Others think that it is Marshall and Lily’s story. And there’s a whole school of thought that it’s no one else but Barney’s story. We would like to think that it’s all of their stories because there won’t be a Ted without Barney or a Lily without Marshall, and definitely no Robin without a Ted (and Barney too). That’s how crucial each of the members of this group is, playing a major role in each other’s lives, helping them grow and become what they wanted to be.”

Source: Collider


🎭 Let’s Start


📕 Libraries


These will be the libraries we will use for our analysis. In every line, you will find the purpose of it.

library(readtext) #For import and Handling for Plain and Formatted Text Files.
library(rvest) #For easily Harvest (Scrape) Web Pages.
library(xml2) #For working with XML files using a simple, consistent interface.
library(polite) #For be responsible when scraping data from websites.
library(httr) #Package for working with HTTP organised by HTTP verbs 
library(tidyverse) #Opinionated collection of R packages designed for data science.
library(tidytext) #Functions and supporting data sets to allow conversion of text.
library(quanteda) #OUR PACKAGE for text analysis. 
library(quanteda.textstats) #OUR SUBPACKAGE for text statistics. 
library(quanteda.textplots) #OUR SUBPACKAGE for text plots. 
library(stringr) #Consistent Wrappers for Common String Operations.
library(spacyr) #NLP package that comes from Python that help us classify words.
library(ggsci) #Collection of high-quality color palettes.
library(ggrepel) # ggrepel provides geoms for ggplot2 to repel overlapping text labels
library(RColorBrewer) #Beautifull color palettes.
library(cowplot) #Package to put images in our plots.
library(magick) #Package for save images in our environment
library(gghighlight) #gghighlight() adds direct labels for some geoms.

#Set image
obj_img <- image_read(path = "https://bit.ly/3twmH2Y")

🧞 Web scrap TV shows scripts


We will do a web scraping of our favorite TV show: “How I Met Your Mother.” For the above, we will do web scraping to obtain the scripts of the 208 episodes that the TV show has. We will define the URLs, obtain the information to know if we can do web scraping, and name the directory where we want to save our files.

🧵 Define URLS and read HTML


  • This chunk of code shows how we can retrieve data from the internet. For our purpose, we will use Sprigfield webpage. Here, you can download the original TV scripts from multiple shows; in our case, we will download the How I Met Your Mother scripts.
v_tv_show <- "how-i-met-your-mother"

v_url_web <- "http://www.springfieldspringfield.co.uk/"

#Remember to be polite and know if we can web scrap the webpage
session_information <- bow(v_url_web) #Do a bow with the polite package
session_information

v_url <- paste(v_url_web,"episode_scripts.php?tv-show=", v_tv_show, sep="")

#Identify yourself
rvest_himym <- session(v_url, 
                       add_headers(`From` = "jurjoo@gmail.com", 
                                   `UserAgent` = R.Version()$version.string))

#Start web scrap
html_url_scrape <- rvest_himym %>% read_html(v_url)

node_selector <- ".season-episode-title"

directory_path <- paste("texts/how-i-met-your-mother/", v_tv_show, sep = "")

🪡 Loop for download TV scripts


### scrape href nodes in .season-episode-title-------------------------

html_url_all_seasons <- html_nodes(html_url_scrape, node_selector) %>%
  html_attr("href")

### One loop for all our URL's----------------------------------------

for (x in html_url_all_seasons) {
  read_ur <- read_html(paste(v_url_web, x, sep="/"))
  
  Sys.sleep(runif(1, 0, 1)) #Be polite
  
  # Element node that was checked and that contain the place of the scripts.
  selector <- ".scrolling-script-container"
  # Scrape the text
  text_html <- html_nodes(read_ur, selector) %>% 
    html_text()
  
  # Last five characters of html_url_all_seasons for saving this to separate text files (This is our pattern).
  sub_data <- function(x, n) {
    substr(x, nchar(x) - n + 1, nchar(x))
  }
  seasons_final <- sub_data(x, 5)
  # Write each text file
  write.csv(text_html, file = paste(directory_path, "_", seasons_final, ".txt", sep=""), row.names = FALSE)
}

🎨 Webscrapp TV Show tables


🎭 Information about TV episodes


It’s important to attach our scripts with relevant information about them. For example, episode title, number of episode, number of season, director, etc. That’s why we will webscrap this information from the internet.

url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"

rvest_himym_table <- session(url_himym, 
                             add_headers(`From` = "jurjoo@gmail.com", 
                                         `UserAgent` = R.Version()$version.string))

l_tables_himym <- rvest_himym_table %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

#This generates a list with all the tables that contain the page. In our case, 
#we want the table from the second element till the 10th. 
l_tables_himym <- l_tables_himym[c(2:10)]

l_tables_himym[1]
## [[1]]
## # A tibble: 22 × 8
##    No.overall `No. inseason` Title       Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
##         <int>          <int> <chr>       <chr>   <chr>   <chr>   <chr>   <chr>  
##  1          1              1 "\"Pilot\"" Pamela… Carter… Septem… 1ALH79  10.94[…
##  2          2              2 "\"Purple … Pamela… Carter… Septem… 1ALH01  10.40[…
##  3          3              3 "\"Sweet T… Pamela… Phil L… Octobe… 1ALH02  10.44[…
##  4          4              4 "\"Return … Pamela… Kourtn… Octobe… 1ALH03  9.84[1…
##  5          5              5 "\"Okay Aw… Pamela… Chris … Octobe… 1ALH04  10.14[…
##  6          6              6 "\"Slutty … Pamela… Brenda… Octobe… 1ALH05  10.89[…
##  7          7              7 "\"Matchma… Pamela… Chris … Novemb… 1ALH07  10.55[…
##  8          8              8 "\"The Due… Pamela… Gloria… Novemb… 1ALH06  10.35[…
##  9          9              9 "\"Belly F… Pamela… Phil L… Novemb… 1ALH09  10.29[…
## 10         10             10 "\"The Pin… Pamela… Carter… Novemb… 1ALH08  12.27[…
## # … with 12 more rows, and abbreviated variable names ¹​`Directed by`,
## #   ²​`Written by`, ³​`Original air date`, ⁴​Prod.code, ⁵​`US viewers(millions)`

🤖 Data cleaning to wrangle html tables (Characters of the TV show)


Of course, we must clean our tables to have a final dataframe with the texts and the information of every episode.

#Reduce the list in one data frame since all of the tables share the same structure 
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 


#We do the same for the characters of HIMYM
url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"

rvest_himym_table_2 <- session(url_himym_characters, 
                               add_headers(`From` = "jurjoo@gmail.com", 
                                           `UserAgent` = R.Version()$version.string))

l_tables_himym_characters <- rvest_himym_table_2 %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>% 
  select(Character)

df_characters_w <- df_characters %>% 
  filter(!stringr::str_starts(Character, "Futu"),
         !(Character %in% c("Character", "Main Characters", 
                            "Supporting Characters"))) %>% 
  mutate(name = str_extract(Character,"([^ ]+)"),
         name = replace(name, name == "Dr.", "Sonya"))

df_characters_w

🌪 Data cleaning to wrangle html tables (Information of the TV Show)


Look how with our code we cleaned the information of the TV Show and know we have it in a dataframe.

#We bind the tables with bind_rows
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 

df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)

df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)

#We are doing this particular wrangling to format in the best possible way our tables. 

#Note that we are using stringr to manipulate our characters.

df_himym_filt_dupl_1 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
         Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))

df_himym_filt_dupl_2 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
         Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
         Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
         Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
         Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))

df_himym_final <- bind_rows(df_himym_filt, 
                            df_himym_filt_dupl_1, 
                            df_himym_filt_dupl_2) %>% 
  arrange(No.overall, No..inseason) %>% 
  mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
         Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>% 
  rename(Chapter = No..inseason)

df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))

df_himym_final

🎛 Load TV scripts and merge tables


This is the final and most important step of the web scrap. Here, we are merging our TV show scripts and the information of the episodes in one single dataframe.

df_texts_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")

v_season <- as.numeric(stringr::str_extract(df_texts_himym$doc_id, "\\d+"))

v_chapter <- as.numeric(stringi::stri_extract_last_regex(df_texts_himym$doc_id, "[0-9]+"))

df_texts_himym_w <- df_texts_himym %>% mutate(Season = v_season, Chapter = v_chapter)

df_himym_final_doc <- full_join(as.data.frame(df_texts_himym_w), df_himym_final, by = c("Season", "Chapter")) %>% 
  mutate(Season_w = paste("Season", Season),
         Title_season = paste0(Title, " S", Season, " EP", Chapter))

df_himym_final_doc

Press the arrows in the top right corner of this interactive dataframe. As you can see, we have our final dataframe with the information our our TV show, number of season, episode, etc.


🎨 Quanteda (Corpus, Token and DFM)


Once, we have our final dataframe, now we can start our analysis using the quanteda package.

🧥 First step: Define a corpus


Look our corpus, it’s divided into types, tokens and even sentences.

corp_himym <- corpus(df_himym_final_doc)  #Build a new corpus from the texts

docnames(corp_himym) <- df_himym_final_doc$Title #With docnames() we can change the name of our texts.
#In our case, we putted the title of the episodes. 

summary(corp_himym, n = 15)

🥽 Second step: Convert corpus into tokens and wrangle it


Look our tokenization, we separate our text into words. Amazing!

corp_himym_stat <- corp_himym

docnames(corp_himym_stat) <- df_himym_final_doc$Title_season


corp_himym_s1_simil <- corpus_subset(corp_himym_stat, Season == 1) #We want to analyze just the first season



toks_himym_s1 <- tokens(corp_himym_s1_simil, #corpus from all the episodes from the first season
                        remove_punct = TRUE, #Remove punctuation of our texts
                        remove_separators = TRUE, #Remove separators of our texts
                        remove_numbers = TRUE, #Remove numbers of our texts
                        remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_remove(stopwords("english")) #Remove stop words of our texts

toks_himym_s1
## Tokens consisting of 22 documents and 12 docvars.
## "Pilot" S1 EP1 :
##  [1] "x"          "OLDER"      "TED"        "Kids"       "gonna"     
##  [6] "tell"       "incredible" "story"      "story"      "met"       
## [11] "mother"     "punished"  
## [ ... and 1,462 more ]
## 
## "Purple Giraffe" S1 EP2 :
##  [1] "x"            "OLDER"        "TED"          "Okay"         "telling"     
##  [6] "us"           "met"          "Mom"          "excruciating" "detail"      
## [11] "Right"        "back"        
## [ ... and 1,374 more ]
## 
## "Sweet Taste of Liberty" S1 EP3 :
##  [1] "x"     "S"     "Sy"    "Syn"   "Sync"  "Sync"  "b"     "OLDER" "TED"  
## [10] "one"   "night" "met"  
## [ ... and 1,350 more ]
## 
## "Return of the Shirt" S1 EP4 :
##  [1] "x"                  "OLDER"              "TED"               
##  [4] "Kids"               "single"             "looking"           
##  [7] "happily-ever-after" "one"                "stories"           
## [10] "can"                "end"                "way"               
## [ ... and 1,477 more ]
## 
## "Okay Awesome" S1 EP5 :
##  [1] "x"     "OLDER" "TED"   "kids"  "like"  "hear"  "story" "time"  "went" 
## [10] "deaf"  "even"  "ask"  
## [ ... and 1,138 more ]
## 
## "Slutty Pumpkin" S1 EP6 :
##  [1] "x"         "OLDER"     "TED"       "know"      "Aunt"      "Robin's"  
##  [7] "big"       "fan"       "Halloween" "Always"    "dressing"  "crazy"    
## [ ... and 1,405 more ]
## 
## [ reached max_ndoc ... 16 more documents ]

🧩 Third step: Convert our tokens into a Document Feature Matrix


Please, take a look into our Document Feature Matrix. Look know how it is counting our ocurrences. We can do multiple things with them.

toks_himym_dm_s1 <- toks_himym_s1 %>% 
                    dfm() #Convert our tokens into a document feature matrix

toks_himym_dm_s1
## Document-feature matrix of: 22 documents, 4,890 features (87.39% sparse) and 12 docvars.
##                                  features
## docs                              x older ted kids gonna tell incredible story
##   "Pilot" S1 EP1                  1     1  22    3    22    6          1     7
##   "Purple Giraffe" S1 EP2         1     5  30    2    19    1          1     1
##   "Sweet Taste of Liberty" S1 EP3 1     3  27    1    15    6          0     2
##   "Return of the Shirt" S1 EP4    1     5  14    1    15    4          0     6
##   "Okay Awesome" S1 EP5           1     3   9    2    11    4          0     5
##   "Slutty Pumpkin" S1 EP6         1     1  16    0    15    1          0     3
##                                  features
## docs                              met mother
##   "Pilot" S1 EP1                   11      1
##   "Purple Giraffe" S1 EP2          10      0
##   "Sweet Taste of Liberty" S1 EP3   1      1
##   "Return of the Shirt" S1 EP4      0      0
##   "Okay Awesome" S1 EP5             0      0
##   "Slutty Pumpkin" S1 EP6           4      0
## [ reached max_ndoc ... 16 more documents, reached max_nfeat ... 4,880 more features ]

🏆 Quanteda analysis



🎯 Similarity between episodes

textstat_simil() function. It’s super useful because we will find the similarity between episodes for the first season.


tstat_simil <- textstat_simil(toks_himym_dm_s1) #Check similarity between episodes of the first season

clust <- hclust(as.dist(tstat_simil)) #Convert our object into a cluster (For visualization purposes)

dclust <- as.dendrogram(clust)  #Convert our cluster into a dendrogram (For visualization purposes)

dclust <- reorder(dclust, 1:22) #Order our visualization
#Seetle colors
nodePar <- list(lab.cex = 1, pch = c(NA, 19), 
                cex.axis = 1.5,
                cex = 2, col = "#0080ff")

par(mar = c(18.1, 6, 2, 3))

#Plot dendogram
plot(dclust, nodePar = nodePar,
     las = 1,
     cex.axis = 2, cex.main = 2, cex.sub = 2,
     main = "How I Met Your Mother Season 1",
     type = "triangle",
     ylim = c(0,1),
     ylab = "",
     edgePar = list(col = 4:7, lwd = 7:7),
     panel.first = abline(h = c(seq(.10, 1, .10)), col = "grey80"))

title(ylab = "Similarity between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)    

rect.hclust(clust, k = 5, border = "red")

Look how amazing the similarity it is. The similarity is higher for episodes like “Zip, Zip, Zip” and Cupcake. And, for the episodes that are less similars are “The Pineapple Incident” and “The Limo”.



🪅 Distance between episodes

textstat_dist() function. Here distance is the opposed of similarity. More distance equals less similar. Rememebr our similarity chart? Well, is the same, but here we are obtaining distance.

tstat_dist <- textstat_dist(toks_himym_dm_s1) #Check similarity between episodes of the first season

clust <- hclust(as.dist(tstat_dist)) #Convert our object into a cluster (For visualization purposes)

dclust_dist <- as.dendrogram(clust)  #Convert our cluster into a dendrogram (For visualization purposes)

dclust_dist <- reorder(dclust, 1:22) #Order our visualization
par(mar = c(21, 6, 2, 3))

#Plot dendogram
plot(dclust_dist, nodePar = nodePar_2,
     las = 1,
     cex.axis = 2, cex.main = 2, cex.sub = 2,
     main = "How I Met Your Mother Season 1",
     type = "triangle",
     ylim = c(0, 120),
     ylab = "",
     edgePar = list(col = 11:19, lwd = 7:7),
     panel.first = abline(h = c(seq(10, 120, 10)), col = "grey80"))

title(ylab = "Distance between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)    

rect.hclust(clust, k = 5, border = "red")

Remember the episodes “The Pineapple Incident” and “The Limo” are the less similar ones? Well, here these episodes are the ones that have more distance between them.


🎎 Appearances of actors by season

Now, we want to know the characters of the TV Show. We will get the number of appearances by actor per season and episode.

#Remember our second step: tokenize our corpus. 

toks_himym <- tokens(corp_himym, #corpus from all the episodes
                     remove_punct = TRUE, #Remove punctuation of our texts
                     remove_separators = TRUE,  #Remove separators of our texts
                     remove_numbers = TRUE, #Remove numbers of our texts
                     remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_remove(stopwords("english")) #Add additional words

dfm_actors <- toks_himym %>% 
  tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>% #We just want to analyze these characters
  tokens_group(groups = Season) %>% #We group our tokens (scripts) by season
  dfm() #Transform the token into a DFM object

df_final_actors <-  as.data.frame(textstat_frequency(dfm_actors, groups = c(1:9))) %>% 
                    mutate(Season = paste("Season", group),
                           `Principal Characters` = replace(feature, is.character(feature), str_to_title(feature))) %>% 
                    select(-feature)

df_final_actors


Here, we plot this frequency of actors. There is no secret that Ted is the most famous character of the show because he is the one that is telling the story to his sons. It’s interesting how Barney had a “glow up” for the end of the last season. What is ironic is that the TV Show is called “How I Met Your Mother”, but the times that “Mother” appears on the seasons is not that much.

# Plot frequency of actors
ggplot1 <- ggplot(df_final_actors, aes(x = group, y = frequency, group = `Principal Characters`, color = `Principal Characters`)) +
  geom_line(size = 1.5) +
  scale_color_manual(values = brewer.pal(n = 6, name = "Dark2")) +
  geom_point(size = 3.2) +
  scale_y_continuous(breaks = seq(0, 5600, by = 50), limits = c(0,560))+
  theme_minimal(base_size = 14) +
  labs(x = "Number of Season",
       y = "Frequencies of appreances",
       title = "Appearances of principal characters by Season",
       caption="Description: This plot show the number of times that the \n principal characters appears in HIMYM per season.")+
       theme(panel.grid.major=element_line(colour="#cfe7f3"),
             panel.grid.minor=element_line(colour="#cfe7f3"),
             plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
             #axis.text.x=element_text(size=15),
             #axis.text.y=element_text(size=15),
             plot.caption=element_text(size=12, hjust=.1, color="#939393"),
             legend.position="bottom",
             plot.margin = margin(t = 20,  # Top margin
                                  r = 50,  # Right margin
                                  b = 40,  # Bottom margin
                                  l = 10), # Left margin
             text=element_text()) + 
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
#             arrow = arrow(length = unit(0.1, "cm")))+
  guides(colour = guide_legend(ncol = 6))

ggdraw(ggplot1) + draw_image(obj_img, x = .97, y = .97, 
                               hjust = 1.1, vjust = .7, 
                               width = 0.11, height = 0.1)


🌊 Wordcloud of PRINCIPAL characters that appears in HIMYM


Wordcloud plots are super useful to analyze how many words and the repetition of them appears in a text. Knowing this, we want to do some analysis using wordclouds.

#Remember our second step: tokenize our corpus. 

toks_himym_characters <- tokens(corp_himym, #corpus from all the episodes from all season
                                remove_punct = TRUE, #Remove punctuation of our texts
                                remove_separators = TRUE, #Remove separators of our texts
                                remove_numbers = TRUE, #Remove numbers of our texts
                                remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) #This function allow us to keep just the tokens that we want. 

#In this case, we just want the characters.

toks_himym_characters
## Tokens consisting of 208 documents and 12 docvars.
## "Pilot" :
##  [1] "TED"      "Marshall" "Marshall" "Ted"      "Lily"     "Lily"    
##  [7] "Marshall" "Marshall" "Barney"   "Marshall" "Lily"     "Marshall"
## [ ... and 58 more ]
## 
## "Purple Giraffe" :
##  [1] "TED"    "Robin"  "Barney" "Ted"    "Ted"    "Ted"    "Robin"  "Robin" 
##  [9] "Lily"   "Ted"    "Ted"    "Lily"  
## [ ... and 61 more ]
## 
## "Sweet Taste of Liberty" :
##  [1] "TED"      "Marshall" "Lily"     "Barney"   "Robin"    "Ted"     
##  [7] "TED"      "Barney"   "Ted"      "Marshall" "Lily"     "Marshall"
## [ ... and 54 more ]
## 
## "Return of the Shirt" :
##  [1] "TED"      "TED"      "Barney"   "Robin"    "Robin"    "Ted"     
##  [7] "Ted"      "Lily"     "Lily"     "Marshall" "Marshall" "Lily"    
## [ ... and 18 more ]
## 
## "Okay Awesome" :
##  [1] "TED"      "Robin"    "Marshall" "Lily"     "TED"      "Lily"    
##  [7] "Lily"     "Marshall" "Ted"      "Ted"      "Marshall" "Marshall"
## [ ... and 27 more ]
## 
## "Slutty Pumpkin" :
##  [1] "TED"   "Lily"  "Robin" "Robin" "Ted"   "Ted"   "Ted"   "TED"   "TED"  
## [10] "Ted"   "LILY"  "Ted"  
## [ ... and 24 more ]
## 
## [ reached max_ndoc ... 202 more documents ]
#Remember our third step: DFM object

dfm_general_characters <- toks_himym_characters %>%
                          dfm()

dfm_general_characters
## Document-feature matrix of: 208 documents, 65 features (88.51% sparse) and 12 docvars.
##                           features
## docs                       ted marshall lily barney carl robin ranjit don
##   "Pilot"                   22       18   14      6    5     3      2   0
##   "Purple Giraffe"          30        5    6      4    0    27      0   1
##   "Sweet Taste of Liberty"  27       15    7     10    0     7      0   0
##   "Return of the Shirt"     14        3    3      2    0     7      0   0
##   "Okay Awesome"             9       13    6      7    0     4      0   0
##   "Slutty Pumpkin"          16        5    5      3    0     6      0   0
##                           features
## docs                       mickey gary
##   "Pilot"                       0    0
##   "Purple Giraffe"              0    0
##   "Sweet Taste of Liberty"      0    0
##   "Return of the Shirt"         1    0
##   "Okay Awesome"                0    0
##   "Slutty Pumpkin"              0    1
## [ reached max_ndoc ... 202 more documents, reached max_nfeat ... 55 more features ]
textplot_wordcloud(dfm_general_characters, 
                   rotation = 0.25,
                   min_size = 1.4, max_size = 8,
                   min_count = 1, #Minimum frequency
                   color = brewer.pal(11, "RdBu"))
#RColorBrewer::display.brewer.all()

And here we have it. Our first wordcloud plot. Looks amazing! Remember that you can change the color of it, the sizes and other relevant things.


☄ Wordcloud of SECONDARY characters that appears in HIMYM


Now, let’s do the same, but just with our secondary characters.

#Remember our second step: tokenize our corpus. 

toks_himym_sec_characters <- tokens(corp_himym, #corpus from all the episodes from all season
                                    remove_punct = TRUE, #Remove punctuation of our texts
                                    remove_separators = TRUE, #Remove separators of our texts
                                    remove_numbers = TRUE, #Remove numbers of our texts
                                    remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
  tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall")) #But we remove the principal characters
#Remember our third step: DFM object

dfm_general_sec_characters <- toks_himym_sec_characters %>%
                              dfm()
textplot_wordcloud(dfm_general_sec_characters, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   min_size = 1, max_size =5,
                   labelsize = 1.5,
                   min_count = 1, #Minimum frequency
                   color = RColorBrewer::brewer.pal(8, "Spectral"))

For example, here, we specified another type of palette. The plotted names are just secondary characters because we removed the principal. Zoey, Stella, and Victoria are Ted’s ex-girlfriends.


🔥🌶️ spaCy and spaCyr


spacyr provides a convenient R wrapper around the Python spaCy package. It offers easy access to the following functionality of spaCy. This package is amazing because here what spacyr is doing is clasifying automatically our words into nouns, adjectives, verbs, dates and much more. Of course, it is not 100% accurate, but it is an amazing tool to do some analysis!



#Be patient because it takes around 5-10 minutes to do the installation. Also, please follow the steps marked on your monitor when you are installing the packages. 

library(spacyr)

spacy_install()

spacy_initialize(model = "en_core_web_sm")

sp_parse_doc <- spacy_parse(df_himym_final_doc, tag=TRUE)

This is our output. Look how the package separate our words automatically and also classified them as nouns, verbs, names, etc. It’s amazing!! Of course, the classification is not 100% accurate, but it gives us a good idea for know different things about our texts.

sp_parse_doc
sp_parse_var <- full_join(sp_parse_doc, df_himym_final_doc, by = c("doc_id"))

#In this case, we will just look the proper names and adjectives.

sp_parse_var_PROPN <- sp_parse_var %>% filter(pos=="PROPN" & stringr::str_starts(entity, "PERSON_B"))

sp_parse_var_ADJ <- sp_parse_var %>% filter(pos=="ADJ")

⛄ Get wordcloud using an spaCyr output

We will get a wordcloud using the spacYr output. We will divide the output into adjectives and other features. Please, check the package, you will not regret it.

#Remember our second step: tokenize our corpus. 

toks_himym_ADJ <- tokens(corp_himym, #corpus from all the episodes from all season
                         remove_punct = TRUE, #Remove punctuation of our texts
                         remove_separators = TRUE,  #Remove separators of our texts
                         remove_numbers = TRUE, #Remove numbers of our texts
                         remove_symbols = TRUE) %>%  #Remove symbols of our texts
  tokens_keep(c(unique(sp_parse_var_ADJ$lemma))) %>% #We want to keep all the adjective
  tokens_remove(c(stopwords("english"), "oh", "yeah", "okay", "like", 
                  "get", "got", "can", "one", "hey", "go",
                  "Ted", "Marshall", "Lily", "Robin", "Barney", "just", 
                  "know", "well", "right", "even", "see", 
                  "sure", "back", "first", "said", "maybe", "wedding", 
                  "whole", "wait")) #But we remove stopwords and other words that the package didn't classify it correctly. 
#Remember our third step: DFM object

df_general_ADJ <- toks_himym_ADJ %>%
  tokens_group(groups = Season_w) %>% #group by season
  dfm() %>% dfm_subset(Season < 9)


Look how amazing are the adjectives distributed into the 8 seasons. Unfortunately the function only allows us 8 groups. Every color are the adjectives available in the different seasons.

textplot_wordcloud(df_general_ADJ, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   comparison = TRUE,
                   labelsize = 1.5, 
                   min_count = 1, #Minimum frequency
                   color = ggsci::pal_lancet(palette = "lanonc"))


🎧 Get frequency of adjectives


We will get a frequency of adjectives using the spacYr output. Again, we can do amazing things in terms of analysis.

#Remember our second step: tokenize our corpus. 

freq_gen_dfm <- toks_himym_ADJ %>%
  dfm()
#Generate dataframe
df_freq_gen_dfm <-  as.data.frame(textstat_frequency(freq_gen_dfm, # Our DFM object
                                                     n = 10, #Number of observations displayed
                                                     groups = Season)) #Grouped by season
                                  
df_freq_gen_dfm_match <- df_freq_gen_dfm %>% mutate(total = 1) %>% 
                                  group_by(feature) %>% 
                                  summarise(total = sum(total)) %>% 
                                  filter(total== 9)

df_freq_gen_dfm_final <- right_join(df_freq_gen_dfm, df_freq_gen_dfm_match,
                                   by = "feature") %>% rename(Word = feature) %>% 
                                   mutate(Word = str_to_title(Word))

Look the frequency of adjectives. It’s incredible how the word sorry appears and tends to be the one that our beautiful characters keep using. Maybe they did awful things? You need to see the TV Show and decide by yourself.

ggplot2 <- ggplot(df_freq_gen_dfm_final, aes(x = group, y = frequency, group = Word, color = Word)) +
  geom_line(size = 1.5, show.legend = TRUE) +
  scale_color_manual(values = rev(brewer.pal(n = 7, name = "Dark2"))) +
  geom_point(size = 3.2) +
  theme_minimal(base_size = 14) +
  labs(x = "Number of Season",
       y = "Frequencies of words",
       title = "Frequency of adjectives",
       caption="Description: This plot shows the top adjectives that appears in every season of HIMYM")+
  theme(panel.grid.major=element_line(colour="#cfe7f3"),
        panel.grid.minor=element_line(colour="#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption=element_text(size=12, hjust=.1, color="#939393"),
        legend.position="bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 40,  # Bottom margin
                             l = 10), # Left margin
        text=element_text()) + 
  #geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
  #             arrow = arrow(length = unit(0.1, "cm")))+
  guides(colour = guide_legend(ncol = 4)) +
  gghighlight(max(frequency) > 140,
              keep_scales = TRUE,
              unhighlighted_params = list(colour = NULL, alpha = 0.2))
  

ggdraw(ggplot2) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


⚡ Network plot


How the characters are related to each other? We will find it with the amazing function network plot.

#Remember our second step: tokenize our corpus. 

token_characters_himym <- tokens(corp_himym, #corpus from all the episodes from all season
                                 remove_punct = TRUE, #Remove punctuation of our texts
                                 remove_separators = TRUE, #Remove separators of our texts
                                 remove_numbers = TRUE, #Remove numbers of our texts
                                 remove_symbols = TRUE) %>%  #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
  #Remember the characters that we web scraped before? Here we are suing that vector to filter characters!
  tokens_tolower() #We want lower cases in our tokens
#Extra step: create a feature co-ocurrence matrix (FCM)

fcm_characters_himym <- token_characters_himym %>%
                        fcm(context = "window", window = 5, tri = FALSE)

🎷 Network plot of all the characters


#Vector with all the characters
v_top_characters <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 70)))

set.seed(100)

textplot_network(fcm_select(fcm_characters_himym, v_top_characters),
                 edge_color = "#008eed", 
                 edge_size = 2, 
                 vertex_labelcolor = "#006fba", 
                 omit_isolated = TRUE,
                 min_freq = .1)

As we expected, the network is around the principal characters, but also we can appreciate how the characters are related to each other. Here, we include every single person that appears on the show. In visualization terms, it could be a mess because we see a lot of lines. Let’s filter this FCM object again to analyze the 30 principal characters according to the frequency.


🎷 Network plot of the principal 30 characters


As we said before, if we want to be more specific, we can reduce our network plot to 30 characters. We will follow the same steps s back but filter to have the top features of the first 30 characters.

#Vector with 30 characters
v_top_characters_2 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 30)))

textplot_network(fcm_select(fcm_characters_himym, v_top_characters_2),
                 edge_color = "#008eed", 
                 edge_size = 5, 
                 vertex_labelcolor = "#006fba",
                 omit_isolated = TRUE,
                 min_freq = .1)


🎸 Network plot of Ted


If we want to be even more specific, we can reduce our network plot and weight it with just one character. In this case, Ted. We are “weighting” this network plot because we want to see the density of how many times that character is related to Ted. We can do this with the previous plots to check how the network changes.

fcm_characters_himym_ted <- token_characters_himym %>%
  tokens_remove(c("marshall", "lily", "barney", "robin")) %>% #Here we just want ted, that why we remove the other principal characters
  fcm(context = "window", window = 5, tri = FALSE)

#Vector with 30 characters
v_top_characters_3 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym_ted, 30)))

#Create a FCM matrix with our characters
vertex_size_f <- fcm_select(fcm_characters_himym_ted, pattern = v_top_characters_3)

#Create a proportion 
v_proportion <- rowSums(vertex_size_f)/min(rowSums(vertex_size_f))

#Vector of Ted
x_p <- c("ted")

#Replace that proportion in our vector
final_v <- replace(v_proportion, names(v_proportion) %in% x_p, 
                   v_proportion[names(v_proportion) %in% x_p]/15)
textplot_network(fcm_select(fcm_characters_himym_ted, v_top_characters_3),
                 edge_color = "#008eed", 
                 edge_size = 5, 
                 vertex_labelcolor = "#006fba",
                 omit_isolated = TRUE,
                 vertex_labelsize = final_v,
                 min_freq = .1)


🎹 Text stat collocation


Identify and score multi-word expressions, or adjacent fixed-length collocations, from text using textstat_collocations().

We want to see which phrases are the more used ones in the context of the first season. This is a simple step to understanding how vital some compound phrases can be.

#Remember our second step: tokenize our corpus. 

toks_himym_s1 <- tokens(corp_himym_s1_simil, #Define our corpus for the first season
                        padding = TRUE) %>% #Leave an empty string where the removed tokens previously existed
  tokens_remove(stopwords("english")) #Remove stopwords of our token
himym_s1_collocations <-textstat_collocations(toks_himym_s1, #Our token object
                                              tolower = F) #Keep capital letters


df_himym_s1_coll <- data.frame(himym_s1_collocations) %>% 
                        rename(`Total of collocations` = count)

Good! look what collocations are like right now, get married and party number are the most used ones in the first season. The Lambda and Z statistics are metrics used to plot the different allocations. Every dot in this graph represents one compound phrase. The size of each dot means how many times the characters said that phrase.

ggplot3 <- ggplot(df_himym_s1_coll, aes(x = z, y = lambda, label = collocation)) +
  geom_point(alpha = 0.2, aes(size = `Total of collocations`), color = "#00578a")+
  geom_point(data = df_himym_s1_coll %>% filter(z > 15), 
             aes(x = z, y = lambda, size = `Total of collocations`),
             color = '#00578a') + 
  geom_text_repel(data = df_himym_s1_coll %>% filter(z > 15), #Function from ggrepel package. Show scatterplots with text.
                  aes(label = collocation, size = count), size = 3,
                  box.padding = unit(0.35, "lines"),
                  point.padding = unit(0.3, "lines")) + 
  scale_y_continuous(breaks = seq(0, 16, by = 1), limits = c(0,16))+
  theme_minimal(base_size = 14) +
  labs(x = "Z statistic",
       y = "Lambda",
       title = "Allocations of words in the First Season",
       caption = "Description: This plot identifies and scores multi-word expressions of the 1st season")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size=12, hjust=.1, color="#939393"),
        legend.position="bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot3) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


🎻 Locate keywords-in-context


What about now to look the most iconics phrases in HIMYM. It’s going to be…wait for it…legendary. We can do that with locate keywords in context.

#Set dataframe to merge with other information--------------------------

df_title_s_chp <- df_himym_final_doc %>% 
                  select(Title, Season, Chapter, No.overall, 
                         Season_w, US.viewers.millions.)

#First step: Define a corpus --------------------------------------

corp_himym <- corpus(df_himym_final_doc)  # build a new corpus from the texts

docnames(corp_himym) <- df_himym_final_doc$Title #Rename docnames with Title of the episode

corp_himym_s5 <- corpus_subset(corp_himym, #our corpus
                               Season == 5) #Filter by season


An example with the word: Love. Because this TV Show talks about love, let’s find this word in the context of different episodes. We will use the fifth season to locate the word in context “Love”.

toks_himym_s5 <- tokens(corp_himym_s5, #Corpus of season 5
                        padding = TRUE)
kw_himym_s5_love <- kwic(toks_himym_s5, #token object.
                         pattern = "love*", #pattern that we want to look for.
                         window = 10) #how many words you want before and after your pattern.
df_kw_himym_s5_love <- as.data.frame(kw_himym_s5_love)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by ="Title") %>% 
  relocate(Title, Season, Chapter)

df_kw_himym_s5_love


That’s amazing: it’s seems that the word love appears 150 times just in the fifth season.


Now, let’s do another example with the word: legendary. We will search this word but for all seasons.

toks_himym <- tokens(corp_himym,  #Define our corpus for all seasons
                     padding = TRUE) #Leave an empty string where the removed tokens previously existed
kw_himym_legendary <- kwic(toks_himym, #token object.
                           pattern = "legendary*",  #pattern that we want to look for.
                           window = 10) #how many words you want before and after your pattern.
df_kw_himym_legendary <- as.data.frame(kw_himym_legendary)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by = "Title") %>% 
  relocate(Title, Season, Chapter)

df_kw_himym_legendary


Mmmhhh, we tought that the word legendary was going to appeared more. Maybe they didn’t mention that so often.


Also, we can even do phrases like: Wait for it. But don’t worry, you don’t need to wait us. We are here.

kw_himym_wait_for <- kwic(toks_himym, #token object.
                          pattern = phrase("wait for it"),  #Here we can specify even a phrase
                          window = 10) #how many words you want before and after your pattern.
df_kw_himym_wait_for <- as.data.frame(kw_himym_wait_for)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by = "Title") %>% 
  relocate(Title, Season, Chapter)


df_kw_himym_wait_for

If you can see, according to our rows, the phrase wait for it appears 80 times thorugh the different episodes.


🔎 Sentiment analyis


EXTRA: just because we were having a lot of fun with this package. We are going to do a quick sentiment analysis.

toks_himym <- tokens(corp_himym, #Our corpus object
                     remove_punct = TRUE, #Remove punctuation in our texts
                     remove_separators = TRUE, #Remove separators in our texts
                     remove_numbers = TRUE, #Remove numbers in our texts
                     remove_symbols = TRUE) %>% #Remove symbols in our texts
  tokens_remove(stopwords("english"))#Add additional words

#tidy_sou <- df_himym_final_doc %>%
#  unnest_tokens(word, text) This is another way on spacyr


We will use the get_sentiments functions from the tidytext package to get positive and negative words. We have four sources. We are going to use bing, but you can choose the one that you like the most. What we are obtaning is just string vectors with negative and positive words.

df_positive_words <- get_sentiments("bing") %>% #We have four options: "bing", "afinn", "loughran", "nrc" 
  filter(sentiment == "positive")

df_negative_words <- get_sentiments("bing") %>%
  filter(sentiment == "negative")


We must define a dictionary to put it into a dictionary and pass it thorugh a dfm object. We know that you are an expert on that now.

#Define a dictionary with positive and negative words from bing --------------------------------------

l_sentiment_dictionary <- dictionary(list(positive = df_positive_words, 
                                        negative = df_negative_words))


💡 Warning: this functions takes 30 minutes: be patience. Don’t worry, we will charge the dataframe for you.

dfm_sentiment_himym <- dfm(toks_himym) %>% dfm_lookup(dictionary = sentiment_dictionary)


We will charge the document for you from the repo that you download it. We got you.

##Load a file
#It is a DFM object, which comes from a token off all the season of HIMYM

load(file = "data/dfm_sentiment_himym.Rdata")

#Rename doc:id with the Titles of every episode
docnames(dfm_sentiment_himym) <- df_himym_final_doc$Title


We will give a format to our dataframe.

#Format in long to plot positive and negative words
df_sentiment_himym <- convert(dfm_sentiment_himym, "data.frame") %>% 
  gather(positive.word, negative.word, key = "Polarity", value = "Words") %>% 
  rename(Title = doc_id) %>% 
  mutate(Title = as_factor(Title)) %>% 
  left_join(df_title_s_chp, by ="Title") %>%
  mutate(Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "negative.word",
                                            replacement = "Negative words")),
         Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "positive.word",
                                            replacement = "Positive words")))

ggplot4 <- ggplot(df_sentiment_himym, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) + 
  geom_bar(stat = 'identity', position = position_dodge(), size = 1) + 
  facet_wrap(~ Season_w)+
  scale_fill_manual(values = c("#c6006f", "#004383")) + 
  scale_y_continuous(breaks = seq(0, 250, by = 50))+
  theme_minimal(base_size = 14) +
  labs(x = "Episodes",
       y = "Frequency of words",
       title = "Total of positive and negative words per season",
       caption="Description: This plot identifies total of positive and negative words \n per season and episode")+
  theme(panel.grid.major = element_line(colour="#cfe7f3"),
        panel.grid.minor = element_line(colour="#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot4) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)

Look the total (raw) words between positive and negative words per season. This is a good metric, but it doesn’t tell us that much. At least, we know how many positive/begative worsd are used in the episodes.


🗞 Weight the feature frequencies in a dfm


dfm_weight() We can be more fair. Let’s know calculate the weight of the words.

This step is the same as the last one, but here we are taking into account the weights to do a fair comparison.

dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop
## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
##                           features
## docs                       positive.word positive.sentiment negative.word
##   "Pilot"                      0.6621005                  0     0.3378995
##   "Purple Giraffe"             0.6722222                  0     0.3277778
##   "Sweet Taste of Liberty"     0.6510417                  0     0.3489583
##   "Return of the Shirt"        0.5977011                  0     0.4022989
##   "Okay Awesome"               0.6257310                  0     0.3742690
##   "Slutty Pumpkin"             0.6267281                  0     0.3732719
##                           features
## docs                       negative.sentiment
##   "Pilot"                                   0
##   "Purple Giraffe"                          0
##   "Sweet Taste of Liberty"                  0
##   "Return of the Shirt"                     0
##   "Okay Awesome"                            0
##   "Slutty Pumpkin"                          0
## [ reached max_ndoc ... 202 more documents ]


We repeat the same proces. It seems that HIMYM is positive after all. Amazing.

df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>% 
  gather(positive.word, negative.word, key = "Polarity", value = "Words") %>% 
  rename(Title = doc_id) %>% 
  mutate(Title = as_factor(Title)) %>% 
  left_join(df_title_s_chp, by = "Title") %>%
  mutate(Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "negative.word",
                                            replacement = "Negative words")),
         Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "positive.word",
                                            replacement = "Positive words")))

### 14.07.02.- Plot total of positive and negative words per season and episode -----

#This step is the same as the last one, but here we are taking into account the weights to do a fair comparison

ggplot5 <- ggplot(df_sentiment_himym_prop, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) + 
  geom_bar(stat = 'identity', position = position_dodge(), size = 1) + 
  facet_wrap(~ Season_w) +
  scale_fill_manual(values = c("#c6006f", "#004383")) + 
  scale_y_continuous(breaks = seq(0, .8, by = .2))+
  theme_minimal(base_size = 14) +
  labs(x = "Episodes",
       y = "Frequency of words",
       title = "Weighted positve and negative words per season",
       caption = "Description: This plot identifies the weighted total of positive and negative words \n per season and episode")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot5) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


📜 Wrangle dfm weight dataframe with measures


Now let’s do a rate to check in which episodes it can be more a negative context. We will use `Scaling Policy Preferences from Coded Political Texts from WILL LOWE, KENNETH BENOIT, SLAVA MIKHAYLOV, MICHAEL LAVER (2010).


They use a balance between positive words/negative words using a log scale, which you can see on the code.


Here is their formula to get the proportion:

\[ \Theta^{(L)}=log\frac{R + .5}{N + .5} \] Where R = positive words and N = negative words.

#Here we applied the formula proposed before. 

df_sentiment_himym_prop_measure <- convert(dfm_sentiment_himym_prop, "data.frame") %>% 
  rename(Sentiment = positive.word)  %>% rename(Title = doc_id) %>% 
  left_join(df_title_s_chp, by = "Title")  %>%
  mutate(measure = log((Sentiment + 0.5)/(negative.word + .5))) %>%
  select(-Season) %>% 
  rename(Season = Season_w)
dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop
## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
##                           features
## docs                       positive.word positive.sentiment negative.word
##   "Pilot"                      0.6621005                  0     0.3378995
##   "Purple Giraffe"             0.6722222                  0     0.3277778
##   "Sweet Taste of Liberty"     0.6510417                  0     0.3489583
##   "Return of the Shirt"        0.5977011                  0     0.4022989
##   "Okay Awesome"               0.6257310                  0     0.3742690
##   "Slutty Pumpkin"             0.6267281                  0     0.3732719
##                           features
## docs                       negative.sentiment
##   "Pilot"                                   0
##   "Purple Giraffe"                          0
##   "Sweet Taste of Liberty"                  0
##   "Return of the Shirt"                     0
##   "Okay Awesome"                            0
##   "Slutty Pumpkin"                          0
## [ reached max_ndoc ... 202 more documents ]


Plot measure of positivity among seasons

Woooow! We confirm that is a positive Show, but it’s interesting how certain episodes, mostly from the last season, have a negative context. This total makes sense because by that time Lily was fighting with Marshall for their baby and Robin, Ted and Barney were having problems (love triangle).

df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>% 
  gather(positive.word, negative.word, key = "Polarity", value = "Words") %>% 
  rename(Title = doc_id) %>% 
  mutate(Title = as_factor(Title)) %>% 
  left_join(df_title_s_chp, by = "Title") %>%
  mutate(Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "negative.word",
                                            replacement = "Negative words")),
         Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "positive.word",
                                            replacement = "Positive words")))




ggplot6 <- ggplot(df_sentiment_himym_prop_measure, aes(x = No.overall, y = measure, 
                                            color = Season, group = Season)) +
  scale_color_manual(values = brewer.pal(n = 9, name = "Set1"))+
  geom_line(size = 1.5) +
  geom_point(size = 3.2) + 
  scale_x_continuous(breaks = seq(0, 208, by = 20))+
  theme_minimal(base_size = 14) +
  labs(x = "Number of episode",
       y = "Rate",
       title = "Measure of positivity among episodes",
       caption="Description: This plot shows the positivity rate of every episode")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        plot.caption = element_text(size=12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 40,  # Bottom margin
                             l = 10), # Left margin
        text = element_text()) + 
  guides(colour = guide_legend(ncol = 3)) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "red", size = 1)


ggdraw(ggplot6) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


Thanks for your attention and we hope you find this material useful. If you have any question, please reach us. You have all of our information on the repo.

Have a good day!

 

A work by Jorge Roa, Augusto Fonseca & Alexander KRaess

Prepared for Intro to Data Science Workshop 2022

Hertie School, Berlin